# Required packages and libraries
library(tidyverse)
library(lubridate)
library(fpp2)
library(fpp3)
library(tsibble)
library(tidymodels)
library(modeltime)
library(modeltime.ensemble) # ensembles
library(timetk)
library(ggplot2)
library(ggmap)
library(fable)
library(feasts)
library(fable.prophet)
# to use tscv package, first these two commented rows should be run.
# install.packages("devtools")
# devtools::install_github("ahaeusser/tscv")
library(tscv)
First dataset - Traffic Intensity
On the below visual, the detailed description of each variable in the data can be found:
The data set trafficMAD is a day-ahead traffic data of Madrid city for January 2022. The data set contains minute time series data which has observations on every 15 minutes from 2022-01-01 to 2022-01-31 for 368 zones (id) in Madrid.
setwd("C:/UC3M Master Lectures/TERM 3/Time Series/PART2/Final")
# Around 4000 points in Madrid where traffic is monitored at minutes
trafficMAD <- read.delim("01-2022-traffic.csv", na.strings = "NaN", sep=",")
head(trafficMAD,5)
## id long_date elem_type intensity occupation charge vmed error
## 1 1001 01/01/2022 00:00 M30 408 1 0 61 N
## 2 1001 01/01/2022 00:15 M30 156 0 0 69 N
## 3 1001 01/01/2022 00:30 M30 192 NA 0 76 N
## 4 1001 01/01/2022 00:45 M30 600 2 0 67 N
## 5 1001 01/01/2022 01:00 M30 1065 2 0 65 N
## integration_period
## 1 5
## 2 5
## 3 5
## 4 5
## 5 4
The 368 zones (id) of the time series:
trafficMAD %>%
group_by(id) %>% summarise(Unique_IDs = n_distinct(id))
## # A tibble: 368 x 2
## id Unique_IDs
## <int> <int>
## 1 1001 1
## 2 1002 1
## 3 1003 1
## 4 1006 1
## 5 1009 1
## 6 1010 1
## 7 1011 1
## 8 1012 1
## 9 1013 1
## 10 1014 1
## # ... with 358 more rows
Second dataset - Location of traffic measurement points
The data set locationMAD contains the location information of each traffic measurement zone (each id in the first dataset) in Madrid city for January 2022. The data has a common id and elem_type column with the other dataset. As seen on the below, in this data, there are the name of each id and their district zone in which area they belong. Moreover, in the data there are utm_x and utm_y columns which are the x and y coordinates of the centroid of the polygon representation of the measurement points. The last two columns which are longitude and latitude represents the points place geographically.
# adding station information
locationMAD = read.delim('pmed_ubicacion_01-2022.csv', sep=",")
locationMAD$id = as.factor(locationMAD$id)
head(locationMAD,5)
## elem_type district id cod_cent
## 1 URB 4 3840 1001
## 2 URB 4 3841 1002
## 3 URB 1 3842 1003
## 4 URB 4 3843 1004
## 5 URB 4 3844 1005
## name utm_x
## 1 Jose Ortega y Gasset E-O - Pº Castellana-Serrano 441615.3
## 2 Jose Ortega y Gasset O-E - Serrano-Pº Castellana 441705.9
## 3 Pº Recoletos N-S - Almirante-Prim 441319.4
## 4 Pº Recoletos S-N - Pl. Cibeles- Recoletos 441301.6
## 5 (AFOROS) Pº Castellana S-N - Eduardo Dato - Pl.Emilio Castelar 441605.8
## utm_y longitude latitude
## 1 4475768 -3.688323 40.43050
## 2 4475770 -3.687256 40.43052
## 3 4474841 -3.691727 40.42213
## 4 4474764 -3.691929 40.42143
## 5 4476132 -3.688470 40.43378
In the data preprocessig section, these two dataset will be merged to get the specific information of the location of each ID. Also the district with the highest traffic intensity will be found and analyzed.
As mentioned before, the data has observations for every 15 minutes, therefore, the time series is quite complex, because the series has high frequency. To reduce the complexity, here, the minute frequency will be aggregated into hour. In this way the time series could be smoother.
Before using any forecasting method, the data has to be arranged such as changing the data type of some columns, adding columns that specifies the required timing like hour, day of the week (such as “Monday, Sunday”) or the number of the week in a month etc.
## Changing the type of the columns
trafficMAD$id = as.factor(trafficMAD$id)
trafficMAD$elem_type = as.factor(trafficMAD$elem_type)
## Dates
trafficMAD$day = format(strptime(trafficMAD$long_date,format='%d/%m/%Y %H:%M'),'%Y-%m-%d %H')
trafficMAD$hourly = hour(strptime(trafficMAD$long_date,format='%d/%m/%Y %H:%M'))
trafficMAD$day_of_week = weekdays(strptime(trafficMAD$long_date,format='%d/%m/%Y %H:%M'))
trafficMAD$week_no = week(strptime(trafficMAD$long_date,format='%d/%m/%Y %H:%M'))
head(trafficMAD,6)
## id long_date elem_type intensity occupation charge vmed error
## 1 1001 01/01/2022 00:00 M30 408 1 0 61 N
## 2 1001 01/01/2022 00:15 M30 156 0 0 69 N
## 3 1001 01/01/2022 00:30 M30 192 NA 0 76 N
## 4 1001 01/01/2022 00:45 M30 600 2 0 67 N
## 5 1001 01/01/2022 01:00 M30 1065 2 0 65 N
## 6 1001 01/01/2022 01:15 M30 1704 5 0 66 N
## integration_period day hourly day_of_week week_no
## 1 5 2022-01-01 00 0 Saturday 1
## 2 5 2022-01-01 00 0 Saturday 1
## 3 5 2022-01-01 00 0 Saturday 1
## 4 5 2022-01-01 00 0 Saturday 1
## 5 4 2022-01-01 01 1 Saturday 1
## 6 5 2022-01-01 01 1 Saturday 1
After doing some data manipulations in the first dataset, here, the modified time series will be merged by id column with the second dataset which has the location information of each id.
trafficMAD = merge(trafficMAD, locationMAD, by="id")
head(trafficMAD,5)
## id long_date elem_type.x intensity occupation charge vmed error
## 1 1001 01/01/2022 00:00 M30 408 1 0 61 N
## 2 1001 01/01/2022 00:15 M30 156 0 0 69 N
## 3 1001 01/01/2022 00:30 M30 192 NA 0 76 N
## 4 1001 01/01/2022 00:45 M30 600 2 0 67 N
## 5 1001 01/01/2022 01:00 M30 1065 2 0 65 N
## integration_period day hourly day_of_week week_no elem_type.y
## 1 5 2022-01-01 00 0 Saturday 1 M30
## 2 5 2022-01-01 00 0 Saturday 1 M30
## 3 5 2022-01-01 00 0 Saturday 1 M30
## 4 5 2022-01-01 00 0 Saturday 1 M30
## 5 4 2022-01-01 01 1 Saturday 1 M30
## district cod_cent name utm_x utm_y longitude latitude
## 1 10 05FT10PM01 05FT10PM01 437146 4473498 -3.740786 40.40973
## 2 10 05FT10PM01 05FT10PM01 437146 4473498 -3.740786 40.40973
## 3 10 05FT10PM01 05FT10PM01 437146 4473498 -3.740786 40.40973
## 4 10 05FT10PM01 05FT10PM01 437146 4473498 -3.740786 40.40973
## 5 10 05FT10PM01 05FT10PM01 437146 4473498 -3.740786 40.40973
Sorting the intensity for each district
After getting the location information of each id, here, the district information will be used to find the highest traffic intensity area of Madrid. As mentioned before, district contains multiple ids and represents the area of Madrid city. Therefore, it is interesting to do time series forecasting in the area of Madrid which has the highest traffic intensity among all.
## Defining the district column as factor
trafficMAD$district = as.factor(trafficMAD$district)
## Aggregating the minute frequency into hour and areas
trafficMAD %>% drop_na(occupation) %>%
group_by(district,day) %>%
summarise(intensity=mean(intensity, na.rm = FALSE)) %>%
mutate(day=as.POSIXct(day, format='%Y-%m-%d %H')) -> trafficMAD_area
## `summarise()` has grouped output by 'district'. You can override using the
## `.groups` argument.
trafficMAD_area %>% arrange(desc(intensity))
## # A tibble: 15,623 x 3
## # Groups: district [21]
## district day intensity
## <fct> <dttm> <dbl>
## 1 14 2022-01-21 15:00:00 2300.
## 2 14 2022-01-14 15:00:00 2234.
## 3 14 2022-01-28 15:00:00 2198
## 4 14 2022-01-21 14:00:00 2175.
## 5 14 2022-01-28 14:00:00 2137.
## 6 14 2022-01-14 14:00:00 2057.
## 7 10 2022-01-03 08:00:00 2041.
## 8 14 2022-01-24 14:00:00 2023.
## 9 14 2022-01-20 14:00:00 2006.
## 10 14 2022-01-20 15:00:00 2006
## # ... with 15,613 more rows
It can be seen after sorting the intensity in descending order, the most traffic has occurred in the area 14, even though there are some high intensity values in the other areas. Therefore, it is good to analyze this district.
Also, the intensity of each district can be visualized as below to see clearly the hourly intensity.
p <- ggplot(data = trafficMAD_area,
aes(x = day, y = intensity, color = district)) +
geom_line() + ggtitle("Traffic Intensity in each district",
subtitle = "2022-01-01 to 2022-01-31")
p + facet_wrap(~district)
It is clear that the 14th district has the highest traffic intensity in January, 2022.
Choosing the highest intensity area in Madrid to analyze
Here, the 14th district will be filtered in the trafficMAD data.
district_id = 14 # district 14
trafficMAD %>%
filter(district==district_id) %>%
## making the 'day' column as factor to use it in group by later.
mutate(day=as.POSIXct(day, format='%Y-%m-%d %H')) -> trafficMAD_14
head(trafficMAD_14,5)
## id long_date elem_type.x intensity occupation charge vmed error
## 1 3495 01/01/2022 00:00 M30 204 0 5 93 N
## 2 3495 01/01/2022 00:15 M30 173 0 2 91 N
## 3 3495 01/01/2022 00:30 M30 492 0 9 84 N
## 4 3495 01/01/2022 00:45 M30 1060 1 22 81 N
## 5 3495 01/01/2022 01:00 M30 1256 2 25 81 N
## integration_period day hourly day_of_week week_no elem_type.y
## 1 15 2022-01-01 00:00:00 0 Saturday 1 M30
## 2 15 2022-01-01 00:00:00 0 Saturday 1 M30
## 3 15 2022-01-01 00:00:00 0 Saturday 1 M30
## 4 15 2022-01-01 00:00:00 0 Saturday 1 M30
## 5 15 2022-01-01 01:00:00 1 Saturday 1 M30
## district cod_cent name utm_x utm_y longitude latitude
## 1 14 PM30752 PM30752 444467.3 4474308 -3.654574 40.41754
## 2 14 PM30752 PM30752 444467.3 4474308 -3.654574 40.41754
## 3 14 PM30752 PM30752 444467.3 4474308 -3.654574 40.41754
## 4 14 PM30752 PM30752 444467.3 4474308 -3.654574 40.41754
## 5 14 PM30752 PM30752 444467.3 4474308 -3.654574 40.41754
The IDs in the district 14
trafficMAD_14 %>%
group_by(id,name) %>% summarise(Unique_IDs = n_distinct(id))
## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument.
## # A tibble: 2 x 3
## # Groups: id [2]
## id name Unique_IDs
## <fct> <chr> <int>
## 1 3495 PM30752 1
## 2 3561 PM30753 1
Plotting the ids area in Madrid map
Here, the 14th district will be plot by using the longitude and latitude of 3495 and 3561 ids.
ids = c(3495,3561)
locationMAD %>%
filter(id==ids) -> coordinates
coordinates
## elem_type district id cod_cent name utm_x utm_y longitude latitude
## 1 M30 14 3495 PM30752 PM30752 444467.3 4474308 -3.654574 40.41754
## 2 M30 14 3561 PM30753 PM30753 444465.8 4474300 -3.654592 40.41747
# getting the map
map14 <- get_map(location = c(longitude = mean(coordinates$longitude), latitude = mean(coordinates$latitude)), zoom = 12,
maptype = "roadmap", scale = 2)
# plotting the map with some points on it
ggmap(map14) +
geom_point(data = coordinates, aes(x = longitude, y = latitude, fill = "darkred", alpha = 0.8), size = 8, shape = 21) +
guides(fill=FALSE, alpha=FALSE, size=FALSE)
As seen after getting the distinct of the IDs in the area 14, there are two ids that are 3495 and 3561. Therefore, I am going to analyze these areas.
Aggregating the minute frequency into hour
After filtering the busiest area of Madrid which is 14 with ids 3495 and 3561, here the frequency will be changed from minute to hour to make the time series smoother. The output will be an “hourly profile” of the entire month (24 lags), because there is 24 hours in a day.
## Aggregating the minute frequency into hour
trafficMAD_14 %>% drop_na(occupation,vmed) %>%
group_by(district,day) %>%
summarise(occupation=median(occupation, na.rm = FALSE),
intensity=mean(intensity, na.rm = FALSE),
vmed=mean(vmed, na.rm = FALSE)) -> trafficMAD_14_hr
## `summarise()` has grouped output by 'district'. You can override using the
## `.groups` argument.
trafficMAD_14_hr
## # A tibble: 744 x 5
## # Groups: district [1]
## district day occupation intensity vmed
## <fct> <dttm> <dbl> <dbl> <dbl>
## 1 14 2022-01-01 00:00:00 0 288. 67.1
## 2 14 2022-01-01 01:00:00 1.5 863. 87
## 3 14 2022-01-01 02:00:00 1.5 696. 81.1
## 4 14 2022-01-01 03:00:00 0.5 421. 67
## 5 14 2022-01-01 04:00:00 0 288. 67
## 6 14 2022-01-01 05:00:00 0.5 258. 57.4
## 7 14 2022-01-01 06:00:00 0.5 340 61.8
## 8 14 2022-01-01 07:00:00 0.5 381. 74.4
## 9 14 2022-01-01 08:00:00 0 342. 77.6
## 10 14 2022-01-01 09:00:00 0 289. 68.9
## # ... with 734 more rows
After getting the hourly multiple time series, here three interesting variables(intensity, occupation and vmed) will be plot to see their series.
trafficMAD_14_hr %>%
pivot_longer(c(3:5), names_to = "var", values_to = "value") %>%
ggplot(aes(x = day, y = value, group = 1)) +
geom_line() +
facet_grid(vars(var),scales = "free") +
labs(title = "Madrid Traffic in district 14 (in hour)",
y = "values")
As seen on the above time series, the mean of intensity per hour and the median of occupation level are almost the same. It makes sense because occupation is an indicator level for the traffic intensity in the data. In the VAR model, these three variables will be analyzed in detail.
The vector autoregressive (VAR) model is a multivariate time series model that relates current observations of a variable with past observations of itself and past observations of other variables. Therefore, in this part, I am going to analyze the three variable in the data.
Converting to a tsbille time series with ts and as_tsibble functions
To do that, the tbille data must be converted to tsbille format.
#trafficMAD_14_hr$occupation <- NULL
## Converting to a tsbille
traffic_tsbl = as_tsibble(trafficMAD_14_hr, index = day)
class(traffic_tsbl)
## [1] "grouped_ts" "grouped_df" "tbl_ts" "tbl_df" "tbl"
## [6] "data.frame"
traffic_tsbl
## # A tsibble: 744 x 5 [1h] <?>
## # Groups: district [1]
## district day occupation intensity vmed
## <fct> <dttm> <dbl> <dbl> <dbl>
## 1 14 2022-01-01 00:00:00 0 288. 67.1
## 2 14 2022-01-01 01:00:00 1.5 863. 87
## 3 14 2022-01-01 02:00:00 1.5 696. 81.1
## 4 14 2022-01-01 03:00:00 0.5 421. 67
## 5 14 2022-01-01 04:00:00 0 288. 67
## 6 14 2022-01-01 05:00:00 0.5 258. 57.4
## 7 14 2022-01-01 06:00:00 0.5 340 61.8
## 8 14 2022-01-01 07:00:00 0.5 381. 74.4
## 9 14 2022-01-01 08:00:00 0 342. 77.6
## 10 14 2022-01-01 09:00:00 0 289. 68.9
## # ... with 734 more rows
Here, while training, the data will be Filtered to take the first 29 days and leaving last 2 days for the evaluation as there are 31 days.
model_var <- traffic_tsbl %>%
filter(day(day)<30) %>%
model(aicc=VAR(vars(occupation,intensity,vmed), season=24),
bic=VAR(vars(occupation,intensity,vmed), ic="bic",season=24)
)
glance(model_var)
## # A tibble: 2 x 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <list> <dbl> <dbl> <dbl> <dbl>
## 1 aicc <dbl [3 x 3]> -7109. 14332. 14342. 14590.
## 2 bic <dbl [3 x 3]> -7132. 14361. 14368. 14579.
When comparing the Bayesian Information Criteria(BIC) and the Akaike’s Information Criteria(AIC), penalty for additional parameters is more in BIC than AIC. Unlike the AIC, the BIC penalizes free parameters more strongly. For instance, occupation is quite similar with intensity, As the BIC score of the bic model is lower than aicc, bic model will be analyzed.
report(model_var$bic[[1]])
## Series: occupation, intensity, vmed
## Model: VAR(4) w/ mean
##
## Coefficients for occupation:
## lag(occupation,1) lag(intensity,1) lag(vmed,1) lag(occupation,2)
## -0.0097 0.0049 -0.0159 0.2695
## s.e. 0.0697 0.0003 0.0040 0.0705
## lag(intensity,2) lag(vmed,2) lag(occupation,3) lag(intensity,3)
## -0.0029 0.0115 0.0104 0.0011
## s.e. 0.0003 0.0050 0.0716 0.0004
## lag(vmed,3) lag(occupation,4) lag(intensity,4) lag(vmed,4) constant
## -0.0106 0.0696 -0.0011 0.0044 0.4861
## s.e. 0.0049 0.0709 0.0003 0.0039 0.2086
##
## Coefficients for intensity:
## lag(occupation,1) lag(intensity,1) lag(vmed,1) lag(occupation,2)
## -41.0141 1.4622 -1.0922 81.0893
## s.e. 18.9054 0.0756 1.0846 19.1342
## lag(intensity,2) lag(vmed,2) lag(occupation,3) lag(intensity,3)
## -0.7903 2.6844 32.5909 0.1747
## s.e. 0.0937 1.3438 19.4187 0.0978
## lag(vmed,3) lag(occupation,4) lag(intensity,4) lag(vmed,4) constant
## -2.4583 25.5861 -0.3123 0.4117 234.5879
## s.e. 1.3383 19.2244 0.0887 1.0545 56.5754
##
## Coefficients for vmed:
## lag(occupation,1) lag(intensity,1) lag(vmed,1) lag(occupation,2)
## -2.7913 0.0206 0.7123 0.7374
## s.e. 0.7511 0.0030 0.0431 0.7602
## lag(intensity,2) lag(vmed,2) lag(occupation,3) lag(intensity,3)
## -0.0040 -0.0340 2.1619 -0.0033
## s.e. 0.0037 0.0534 0.7715 0.0039
## lag(vmed,3) lag(occupation,4) lag(intensity,4) lag(vmed,4) constant
## -0.0131 -0.3487 -0.0054 -0.1294 30.3338
## s.e. 0.0532 0.7637 0.0035 0.0419 2.2476
##
## Residual covariance matrix:
## occupation intensity vmed
## occupation 0.4549 102.1403 1.3679
## intensity 102.1403 33466.0145 614.2559
## vmed 1.3679 614.2559 52.8196
##
## log likelihood = -7132.38
## AIC = 14360.76 AICc = 14368.07 BIC = 14578.66
Let’s analyze the residuals. Here, the .innov is used which contains the innovation residuals In this case, as no transformation has been used then the innovation residuals are identical to the regular residuals.
Residuals are useful in checking whether a model has adequately captured the information in the data. For this reason, innovation residuals will be checked with ACF(.innov).
model_var %>%
select(bic) %>%
augment() %>%
ACF(.innov) %>%
autoplot()
These there ACF show that the mean of the residuals is close to zero and there is no significant correlation in the residuals series. The variation of the residuals stays much the same across the historical data, however there are outliers at lag 24 which indicates that the ACF clearly shows an hourly seasonal trend which peaks at hourly lag 24.
Here, the following two days (24+24=48 hours) will be estimated.
model_var%>%
select(bic) %>%
forecast(h=48) %>%
autoplot(traffic_tsbl, size=2)
## `mutate_if()` ignored the following grouping variables:
## * Column `district`
As seen on the above results, the intensity and occupation are almost the same. As the aim is to predict the traffic congestion, the response variable should be intensity. In this case, first the traffic intensity will be predicted and then dynamic regression model will be implied to analyze the most two significant variables, vmed and intensity.
In this part, first the Traffic Intensity in the most busy area in Madrid will be predicted by defining a manual seasonal ARIMA model.
# Traffic congestion in the 14th area
traffic_tsbl %>% ggplot(aes(x=day , y=intensity)) + geom_line(col="deepskyblue2",size=1.2) +
labs(title = 'Monthly Traffic Congestion in the district-14', subtitle="January, 2022", x = '', y = 'intensity') + theme_minimal()
As seen on the above plot, overall, the traffic intensity in the most busy area in Madrid tend to increase from the beginning to the end of January.
Residuals of Intensity variable
traffic_tsbl %>% gg_tsdisplay(y=intensity, plot_type = "partial", lag=24)
As seen on the above result, ACF/PACF plots of the time series presented above provide some insight for the possible models. All these graphs indicate that, a time series model with seasonal period 24 may be suitable, because residuals are representing seasonality, and they are not stationary. Therefore, it would be good to fit a seasonal ARIMA model. To choose the best seasonal ARIMA, it should be looked at the PACF to select the non-seasonal part of the model and the ACF to select the seasonal part of the model. After checking the residuals, ARIMA(1,0,1)(2,1,0) would be good model for this data. However, later this arima model will be tested by implying the auto arima, to see whether it is the best or not.
Seasonal ARIMA Model - Manuel
model_arima_manuel <- traffic_tsbl %>%
filter(day(day)<30) %>%
model(ARIMA(intensity ~ pdq(p=1, d=0, q=1) + PDQ(P=2, D=1, Q=0)))
report(model_arima_manuel)
## Series: intensity
## Model: ARIMA(1,0,1)(2,1,0)[24]
##
## Coefficients:
## ar1 ma1 sar1 sar2
## 0.8020 0.1798 -0.4142 -0.2633
## s.e. 0.0267 0.0445 0.0396 0.0388
##
## sigma^2 estimated as 18205: log likelihood=-4251.26
## AIC=8512.51 AICc=8512.6 BIC=8535.06
The AIC has been found 8512 and BIC 8535. Let’s analyze the residuals of fitted arima model.
model_arima_manuel %>% gg_tsresiduals(lag=24)
As seen on the above results, residuals of the fitted model presents stationarity which means that this model would be a good fit for this dataset.
Forecasting with the manuel ARIMA
Here, the next 48 hours (2days) will be predicted.
fc = model_arima_manuel %>% forecast(h = 48)
fc %>%
autoplot(traffic_tsbl, size=1.5) +
labs(title = "Traffic Intensity",subtitle='Hourly Forecasts for the 2 days ahead', x='', y = '') + theme_minimal()
## `mutate_if()` ignored the following grouping variables:
## * Column `district`
Evaluation Metrics for the Test set
estimations <- forecast(model_arima_manuel,h = 48)
test <- trafficMAD_14_hr%>%
filter(day(day)>=30)
rmse <- function (x, y)
{
RMSE <- sqrt(mean((y - x)^2))
return(RMSE)
}
glue::glue('RMSE: ', rmse(estimations$.mean,test$intensity))
## RMSE: 223.77997580009
The RMSE of the test set is found 223. To see other evaluation metrics, accuracy function will be used from the forecast package.
forecast::accuracy(estimations$.mean,test$intensity)
## ME RMSE MAE MPE MAPE
## Test set -44.62964 223.78 187.7288 -23.18167 35.44874
The evaluation results for the ARIMA(1,0,1)(2,1,0)[24] model has been found 223 for RMSE and 187 for MAE. To see whether this model is the best, the auto arima will be implemented in the automatic forecasting part.
Automatic Forecasting Models
In this part, arima, ets, tslm and prophet models will be tried respectively. Also, the manuel seasonal arima model will be added to see whether it is the best choice or not. Models will be trained and tested by using time series cross validation sliding mode to see and improve the model performance.
# Seasonal plot
traffic_tsbl %>%
plot_time_series(.date_var=day, .value=intensity, .interactive = TRUE,
.title='Traffic intensity in Madrid for the district 14 in January, 2022',
.y_lab = "Monthly data in hour")
Seasonal decomposition with model time:
traffic_tsbl %>%
plot_stl_diagnostics(day, intensity,
.frequency = "auto", .trend = "auto",
.feature_set = c("observed", "season", "trend", "remainder"),
.interactive = T)
On the below STL diagnotics, the hourly seasonality could be seen clearly. Also as mentioned before, in the traffic intensity hourly time series, there is a upward trend until the end of the month.
Time-series cross-validation
To prepare the data set for time series cross-validation, the tscv package will be used. From this package, the function make_split() will split the data into several slices for training and testing for time series cross-validation. For the split windows, there are two modes available, stretch and slide. The first is an expanding window approach, while the latter is a fixed window approach. For this project, the slide mode will be used. (Because in real time-series, as they are non-stationary, it is usually better to consider a fixed training window than an expanding one.)
The size for the training window will be 600 and in the testing window, 24 hours will be predicted. As it is an hourly time series, therefore the n_lag is 0 and the step size for increments (n_skip) is defined 23.
series_id = "district"
value_id = "intensity"
index_id = "day"
context <- list(
series_id = series_id,
value_id = value_id,
index_id = index_id
)
# Setup for time series cross validation
type = "first"
value = 600 # size for training window
n_ahead = 24 # size for testing window (= forecast horizon)
n_skip = 23 # skip 23 observations because it is an hourly data
n_lag = 0 # no lag
mode = "slide" # fixed window approach
exceed = FALSE
split_frame <- make_split(
main_frame = trafficMAD_14_hr,
context = context,
type = type,
value = value,
n_ahead = n_ahead,
n_skip = n_skip,
n_lag = n_lag,
mode = mode,
exceed = exceed
)
split_frame
## # A tibble: 6 x 4
## district split train test
## <fct> <int> <list> <list>
## 1 14 1 <int [600]> <int [24]>
## 2 14 2 <int [600]> <int [24]>
## 3 14 3 <int [600]> <int [24]>
## 4 14 4 <int [600]> <int [24]>
## 5 14 5 <int [600]> <int [24]>
## 6 14 6 <int [600]> <int [24]>
By defining these settings, the dataset has been split into 6 folds.
From these splits, train and test dataframes will be created by using slice_train and slice_test functions as in split_frame the indices are known, basically here with this function these indices will be taken from the main trafficMAD_14_hr data.
# Slice training data from main_frame according to split_frame
train_frame <- slice_train(
main_frame = trafficMAD_14_hr,
split_frame = split_frame,
context = context
)
head(train_frame,5)
## # A tibble: 5 x 6
## # Groups: district [1]
## district day occupation intensity vmed split
## <fct> <dttm> <dbl> <dbl> <dbl> <int>
## 1 14 2022-01-01 00:00:00 0 288. 67.1 1
## 2 14 2022-01-01 01:00:00 1.5 863. 87 1
## 3 14 2022-01-01 02:00:00 1.5 696. 81.1 1
## 4 14 2022-01-01 03:00:00 0.5 421. 67 1
## 5 14 2022-01-01 04:00:00 0 288. 67 1
After creating a tibble train_frame, this should be converted to tsibble format.
# Convert tibble to tsibble
train_frame_tsbl <- train_frame %>%
as_tsibble(key = c(district, split),index = day)
head(train_frame_tsbl,5)
## # A tsibble: 5 x 6 [1h] <?>
## # Key: district, split [1]
## # Groups: district [1]
## district day occupation intensity vmed split
## <fct> <dttm> <dbl> <dbl> <dbl> <int>
## 1 14 2022-01-01 00:00:00 0 288. 67.1 1
## 2 14 2022-01-01 01:00:00 1.5 863. 87 1
## 3 14 2022-01-01 02:00:00 1.5 696. 81.1 1
## 4 14 2022-01-01 03:00:00 0.5 421. 67 1
## 5 14 2022-01-01 04:00:00 0 288. 67 1
Every step for creating the train_frame is also done for defining the test_frame.
# Slice training data from main_frame according to split_frame
test_frame <- slice_test(
main_frame = trafficMAD_14_hr,
split_frame = split_frame,
context = context
)
test_frame_tsbl <- test_frame %>% as_tsibble(key = c(district, split),index = day)
head(test_frame_tsbl,5)
## # A tsibble: 5 x 6 [1h] <?>
## # Key: district, split [1]
## # Groups: district [1]
## district day occupation intensity vmed split
## <fct> <dttm> <dbl> <dbl> <dbl> <int>
## 1 14 2022-01-26 00:00:00 1 328. 64 1
## 2 14 2022-01-26 01:00:00 0 156. 49.1 1
## 3 14 2022-01-26 02:00:00 0 99 41.4 1
## 4 14 2022-01-26 03:00:00 0 95.6 39.4 1
## 5 14 2022-01-26 04:00:00 0 125. 44.1 1
Forecasting Fable auto models with time series cross validation
model_fable_auto <- train_frame_tsbl %>%
model(arima_manuel = ARIMA(intensity ~ pdq(p=1, d=0, q=1) + PDQ(P=2, D=1, Q=0)),
arima = ARIMA(intensity),
ets = ETS(intensity ~ season(c("A","M"), period=24)),
lm = TSLM(intensity ~ I(day(day)^2) + season(period=24)),
prophet = prophet(intensity ~ season(period = 24))
)
Training Accuracy
fabletools::accuracy(model_fable_auto)
## # A tibble: 30 x 12
## district split .model .type ME RMSE MAE MPE MAPE MASE RMSSE
## <fct> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 14 1 arima_ma~ Trai~ 4.65e+ 0 135. 85.3 -11.5 23.0 0.484 0.492
## 2 14 1 arima Trai~ 4.65e+ 0 135. 85.3 -11.5 23.0 0.484 0.492
## 3 14 1 ets Trai~ 2.67e- 1 127. 88.5 -9.76 22.3 0.502 0.463
## 4 14 1 lm Trai~ -1.99e-14 234. 181. -26.2 45.3 1.02 0.850
## 5 14 1 prophet Trai~ -2.35e- 1 237. 183. -27.7 48.8 1.04 0.862
## 6 14 2 arima_ma~ Trai~ 4.20e+ 0 131. 82.5 -11.3 21.7 0.491 0.486
## 7 14 2 arima Trai~ 4.20e+ 0 131. 82.5 -11.3 21.7 0.491 0.486
## 8 14 2 ets Trai~ 1.44e- 1 122. 84.8 -9.68 21.5 0.504 0.453
## 9 14 2 lm Trai~ -5.21e-15 216. 166. -24.1 42.8 0.984 0.803
## 10 14 2 prophet Trai~ 7.95e- 2 224. 171. -25.5 45.9 1.02 0.835
## # ... with 20 more rows, and 1 more variable: ACF1 <dbl>
fabletools::accuracy(model_fable_auto) %>%
plot_line(
x = split,
y = MAE,
facet_var = .type,
facet_nrow = 1,
color = .model,
title = "Evaluation of forecast accuracy by split",
subtitle = "Mean absolute error (MAE)",
xlab = "Split"
)
As seen from the training accuracies, the least training error has been found with arima and arima_manuel models. As the error rate was found the same, it means that the chosen arima model in auto arima was the same with manuel defined arima. (This will be analyzed in detail in the following part.) Then the second one is ets model. On the other hand, model lm and prophet could not perform well in each split.
Therefore, let’s analyze the best two models, arima and ets.
model_fable_auto$arima
## <lst_mdl[6]>
## [1] <ARIMA(1,0,1)(2,1,0)[24]> <ARIMA(1,0,1)(2,1,0)[24]>
## [3] <ARIMA(1,0,1)(2,1,0)[24]> <ARIMA(1,0,1)(2,1,0)[24]>
## [5] <ARIMA(2,0,0)(1,1,0)[24]> <ARIMA(1,0,1)(2,1,0)[24]>
As seen on the above results, only in the 5th split, the arima model was chosen differently which is ARIMA(2,0,0)(1,1,0)[24]. For the remaining splits, ARIMA(1,0,1)(2,1,0)[24] model was chosen which is the same with the manuel defined seasonal arima model.
Let’s analyze the training split residuals with the least MAE score which is in the 4th split.
model_fable_auto %>% select(arima) %>% filter(split==4) %>% gg_tsresiduals(lag=24)
The residuals look stationary.
model_fable_auto$ets
## <lst_mdl[6]>
## [1] <ETS(A,N,A)> <ETS(A,N,A)> <ETS(A,N,A)> <ETS(A,N,A)> <ETS(A,N,A)>
## [6] <ETS(A,N,A)>
As seen on the above results, the ETS(A,N,A) model was chosen for all the splits. Therefore, ETS(A,N,A) model will be fit to the data. ETS model combines a “level”, “trend” (slope) and “seasonal” component to describe a time series. Therefore, in this case, the model of ETS has the Additive Error, None Trend and Additive Seasonal parameters.
Let’s analyze the training split residuals with the least MAE score which is in the 4th split.
model_fable_auto %>% select(ets) %>% filter(split==4) %>% gg_tsresiduals(lag=24)
It can be seen on the acf plot that the residuals of arima fit look more stationary than the ets one.
Forecasting
# Forecasting
fable_frame <- model_fable_auto %>%
forecast(h = n_ahead) ## estimating the next 24 hours
fable_frame
## # A fable: 720 x 6 [1h] <?>
## # Key: district, split, .model [30]
## district split .model day intensity .mean
## <fct> <int> <chr> <dttm> <dist> <dbl>
## 1 14 1 arima_manuel 2022-01-26 00:00:00 N(377, 19176) 377.
## 2 14 1 arima_manuel 2022-01-26 01:00:00 N(222, 36732) 222.
## 3 14 1 arima_manuel 2022-01-26 02:00:00 N(148, 47981) 148.
## 4 14 1 arima_manuel 2022-01-26 03:00:00 N(140, 55188) 140.
## 5 14 1 arima_manuel 2022-01-26 04:00:00 N(177, 59807) 177.
## 6 14 1 arima_manuel 2022-01-26 05:00:00 N(253, 62766) 253.
## 7 14 1 arima_manuel 2022-01-26 06:00:00 N(538, 64662) 538.
## 8 14 1 arima_manuel 2022-01-26 07:00:00 N(719, 65877) 719.
## 9 14 1 arima_manuel 2022-01-26 08:00:00 N(1030, 66655) 1030.
## 10 14 1 arima_manuel 2022-01-26 09:00:00 N(976, 67154) 976.
## # ... with 710 more rows
# Converting fable_frame to future_frame (tibble)
future_frame <- make_future(
fable = fable_frame,
context = context
)
future_frame
## # A tibble: 720 x 6
## day district model split horizon point
## <dttm> <fct> <chr> <int> <int> <dbl>
## 1 2022-01-26 00:00:00 14 arima_manuel 1 1 377.
## 2 2022-01-26 01:00:00 14 arima_manuel 1 2 222.
## 3 2022-01-26 02:00:00 14 arima_manuel 1 3 148.
## 4 2022-01-26 03:00:00 14 arima_manuel 1 4 140.
## 5 2022-01-26 04:00:00 14 arima_manuel 1 5 177.
## 6 2022-01-26 05:00:00 14 arima_manuel 1 6 253.
## 7 2022-01-26 06:00:00 14 arima_manuel 1 7 538.
## 8 2022-01-26 07:00:00 14 arima_manuel 1 8 719.
## 9 2022-01-26 08:00:00 14 arima_manuel 1 9 1030.
## 10 2022-01-26 09:00:00 14 arima_manuel 1 10 976.
## # ... with 710 more rows
Estimating accuracy metrics by forecast horizon
In this part, forecast errors will be plot for each predicted points to see in which part models’ prediction is with higher error score.
accuracy_horizon <- make_accuracy(
future_frame = future_frame,
main_frame = trafficMAD_14_hr,
context = context,
dimension = "horizon"
)
# Visualize results
accuracy_horizon %>%
filter(metric == "MAE") %>%
plot_line(
x = n,
y = value,
facet_var = district,
facet_nrow = 1,
color = model,
title = "Evaluation of forecast accuracy by forecast horizon",
subtitle = "Mean absolute error (MAE)",
xlab = "Forecast horizon (n-step-ahead)"
)
As seen on the above results, the forecast errors for each hour was plot. In general, each model’s forecast accuracy is getting worse while predicting the 5-10 hours period and 15-20 hours period. On the other hand, the beginning of the day (night time) have predicted with the least error for each model. Overall, the arima and ets models are better to predict.
Forecast accuracy by split
# Estimate accuracy metrics by forecast horizon
accuracy_split <- make_accuracy(
future_frame = future_frame,
main_frame = trafficMAD_14_hr,
context = context,
dimension = "split"
)
# Visualize results
accuracy_split %>%
filter(metric == "MAE") %>%
plot_line(
x = n,
y = value,
facet_var = district,
facet_nrow = 1,
color = model,
title = "Evaluation of forecast accuracy by split",
subtitle = "Mean absolute error (MAE)",
xlab = "Split"
)
On the above visual, the MAE error rate for each split was plot. The least MAE error has been found in both arima and arima model. They are the same because automatic arima chose the same seasonal ARIMA model. That’s why the red line for auto arima only appears between 4 and 6 split, because the model is changing for the split 5.
Therefore, I am going to visualize the traffic intensity prediction with split 2 as it has the least error.
# It is convenient to define a training set
model_arima <- train_frame_tsbl %>%
filter(split==2) %>%
model(arima = ARIMA(intensity))
model_arima$arima
## <lst_mdl[1]>
## [1] <ARIMA(1,0,1)(2,1,0)[24]>
Forecasting
Here, the following two days hourly traffic intensity will be predicted.
fc2 = model_arima %>% forecast(h = n_ahead)
fc2 %>%
autoplot(rbind(train_frame_tsbl %>% filter(split==2),test_frame_tsbl %>% filter(split==2)), size=1.5, level = NULL) +
labs(title = "Hourly Traffic Intensity",subtitle='2022-01-01 to 2022-01-31', x='', y = '') + theme_minimal()
## `mutate_if()` ignored the following grouping variables:
## * Column `district`
Prediction Intervals
pred_interval <- fc2 %>% hilo(level = c(80, 95))
pred_interval %>% filter(.model=="arima") -> pred_interval_arima
pred_interval_arima$`95%`
## <hilo[24]>
## [1] [ 90.17934, 614.3613]95 [-182.36791, 543.5951]95
## [3] [-289.85891, 539.1362]95 [-329.38761, 558.8603]95
## [5] [-311.06216, 612.8469]95 [-206.38509, 739.4750]95
## [7] [ 99.85046, 1059.3911]95 [ 379.47416, 1347.6031]95
## [9] [ 806.42970, 1779.9739]95 [ 629.79104, 1606.7589]95
## [11] [ 646.29471, 1625.4308]95 [ 736.21570, 1716.7263]95
## [13] [ 830.41085, 1811.7935]95 [1051.39926, 2033.3353]95
## [15] [1492.96042, 2475.2478]95 [1444.92294, 2427.4334]95
## [17] [ 953.75828, 1936.4103]95 [1221.77825, 2204.5203]95
## [19] [1385.84890, 2368.6480]95 [1299.50244, 2282.3378]95
## [21] [1067.94057, 2050.7990]95 [ 679.40410, 1662.2772]95
## [23] [ 370.22130, 1353.1037]95 [ 18.82449, 1001.7128]95
This above intervals are for the 95% prediction interval. The negative lower bound does not seem logical as the intensity cannot have a negative value. For instance for the second prediction interval it can be said according to the results that with 95% confident that traffic intensity at 1 am will be between -182.36791 and 543.5951. Another example is for the first prediction interval, it can be said that with 95% confident, traffic intensity at 0 (12pm) will be between 90.17934 and 614.3613.
Average Accuracy in splits for each model
fable_frame %>% fabletools::accuracy(test_frame_tsbl) %>%
group_by(.model) %>%
summarise(avg_mae=mean(MAE, na.rm = FALSE),avg_rmse=mean(RMSE,na.rm = FALSE))
## # A tibble: 5 x 3
## .model avg_mae avg_rmse
## <chr> <dbl> <dbl>
## 1 arima 176. 221.
## 2 arima_manuel 180. 225.
## 3 ets 228. 280.
## 4 lm 188. 228.
## 5 prophet 193. 243.
After summarizing the test accuracies of each model for MAE and RMSE scores, it is clear that auto arima model has the least error rate. Then the manuel defined seasonal arima is the second best.
In this part, four machine learning algorithms (glmnet,random forest, XGBoost and neural networks) will be studied to predict the traffic intensity in the district-14.
Before implying any machine learning algorithm, first the recipe should be defined. Because, Machine learning models are more complex than univariate models (such as ARIMA). Due to this complexity, machine learning implementation generally requires a set of processes. The first step is to create a recipe for preprocessing. This should be defined to learn the pattern of the time series in training. Then, as a second step, model should be defined and specified with their workflows. These models will be defined with modeltime_table() function in order to organize the models with IDs. Later, these specified workflows should be fitted to the data and then forecasting should be done.
In the following chunk, all these steps have been implied step by step for each split. At the end of the loop, each model’s test accuracies has been calculated in each split.
# empty data frame
errors <- data.frame(matrix(ncol = 10, nrow = 0))
for (sp in 1:nrow(split_frame)) {
########## Preprocessing - creating the recipe
recipe_spec <- recipe(intensity ~ day, train_frame_tsbl %>% filter(split==sp)) %>%
step_timeseries_signature(day) %>%
# step_rm is to remove features we are not going to use
step_rm(contains("am.pm"), contains("minute"),contains("year"),contains("month"),
contains("week"), contains("second"), contains("xts"), contains("iso")
) %>%
step_dummy(all_nominal()) %>%
step_fourier(day, K=1, period = 24) %>%
step_normalize(day_hour12)
########## models and workflows
# glmnet
model_spec_glmnet <- linear_reg(penalty = 0.01, mixture = 0.5) %>%
set_engine("glmnet")
workflow_fit_glmnet <- workflow() %>%
add_model(model_spec_glmnet) %>%
add_recipe(recipe_spec %>% step_rm(day)) %>%
fit(train_frame_tsbl %>% filter(split==sp))
# randomForest
model_spec_rf <- rand_forest(trees = 200) %>%
set_engine("randomForest")
workflow_fit_rf <- workflow() %>%
add_model(model_spec_rf) %>%
add_recipe(recipe_spec %>% step_rm(day)) %>%
fit(train_frame_tsbl %>% filter(split==sp))
# XGBoost
model_spec_xgboost <- boost_tree() %>%
set_engine("xgboost")
wflw_fit_xgboost <- workflow() %>%
add_model(model_spec_xgboost) %>%
add_recipe(recipe_spec %>% step_rm(day)) %>%
fit(train_frame_tsbl %>% filter(split==sp))
# NNETAR
model_spec_nnetar <- nnetar_reg(seasonal_period = 24) %>%
set_engine("nnetar")
wflw_fit_nnetar <- workflow() %>%
add_model(model_spec_nnetar) %>%
add_recipe(recipe_spec) %>%
fit(train_frame_tsbl %>% filter(split==sp))
############## defining the model table
model_table <- modeltime_table(
workflow_fit_glmnet,
workflow_fit_rf,
wflw_fit_xgboost,
wflw_fit_nnetar)
############## calibration table
calibration_table <- model_table %>%
modeltime_calibrate(test_frame %>% filter(split==sp))
errors <- rbind(errors, cbind(calibration_table %>%modeltime_accuracy(),split = sp))
}
errors %>%
arrange(.model_id)
## .model_id .model_desc .type mae mape mase smape
## 1 1 GLMNET Test 231.40885 72.715378 1.1252439 37.536556
## 2 1 GLMNET Test 192.35280 42.624121 1.0724476 29.590428
## 3 1 GLMNET Test 173.13476 45.424144 0.8989953 27.548745
## 4 1 GLMNET Test 172.93223 23.686182 1.3267298 23.061434
## 5 1 GLMNET Test 202.97957 32.064433 1.4380748 33.072783
## 6 1 GLMNET Test 215.96430 76.395879 1.0614197 37.184388
## 7 2 RANDOMFOREST Test 71.63340 17.329244 0.3483231 14.344288
## 8 2 RANDOMFOREST Test 51.56563 21.384201 0.2875000 15.000600
## 9 2 RANDOMFOREST Test 65.06037 12.449286 0.3378234 10.597963
## 10 2 RANDOMFOREST Test 97.22560 16.494320 0.7459113 14.558848
## 11 2 RANDOMFOREST Test 56.15521 9.958338 0.3978498 9.533621
## 12 2 RANDOMFOREST Test 67.72287 19.086719 0.3328438 14.756105
## 13 3 XGBOOST Test 58.48969 6.207133 0.2844108 7.085307
## 14 3 XGBOOST Test 46.36161 6.005814 0.2584854 6.039560
## 15 3 XGBOOST Test 69.40782 7.381530 0.3603973 7.705369
## 16 3 XGBOOST Test 60.80116 6.993087 0.4664643 6.738997
## 17 3 XGBOOST Test 49.41044 6.492984 0.3500643 6.748933
## 18 3 XGBOOST Test 57.07401 9.258590 0.2805069 8.613378
## 19 4 NNAR(1,1,10)[24] Test 74.22865 8.742027 0.3609427 8.229002
## 20 4 NNAR(1,1,10)[24] Test 39.69182 5.573578 0.2212986 5.320445
## 21 4 NNAR(1,1,10)[24] Test 43.47799 5.960915 0.2257577 5.726547
## 22 4 NNAR(1,1,10)[24] Test 61.30083 7.914343 0.4702978 7.611275
## 23 4 NNAR(1,1,10)[24] Test 50.06353 8.272888 0.3546914 8.153844
## 24 4 NNAR(1,1,10)[24] Test 44.03137 9.701279 0.2164051 8.722336
## rmse rsq split
## 1 259.86031 0.8894428 1
## 2 219.70028 0.9069804 2
## 3 208.96817 0.9234675 3
## 4 240.86526 0.8263456 4
## 5 247.63813 0.7891885 5
## 6 241.69813 0.8932117 6
## 7 87.77253 0.9873683 1
## 8 68.31983 0.9943054 2
## 9 76.58205 0.9866995 3
## 10 116.52105 0.9452047 4
## 11 68.11896 0.9684356 5
## 12 82.53661 0.9870516 6
## 13 116.83515 0.9732218 1
## 14 61.54356 0.9950747 2
## 15 91.49927 0.9810309 3
## 16 79.56342 0.9581734 4
## 17 65.72573 0.9787280 5
## 18 85.06645 0.9855021 6
## 19 98.42482 0.9814837 1
## 20 53.27243 0.9968212 2
## 21 55.25984 0.9928755 3
## 22 82.97193 0.9630818 4
## 23 62.38520 0.9821516 5
## 24 64.18482 0.9903956 6
Forecast accuracy by each split
# Visualize results
errors %>%
arrange(.model_id) %>%
plot_line(
x = split,
y = mae,
facet_var = .type,
facet_nrow = 1,
color = .model_desc,
title = "Evaluation of forecast accuracy by split",
subtitle = "Mean absolute error (MAE)",
xlab = "Split"
)
According to the split accuracy plot, glmnet model has performed the worst among all models, on the other hand, the NNAR(1,1,10)[24] model seems more regular for each split. In split 2, random forest model had the least MAE score, while the model has the highest error in split 4. XGBoost is similar with the neural network as there is not so much fluctuation in the error lines for each split. However, the XGBoost model performed worst in the split 3.
Visualizing the forecasts
calibration_table %>%
modeltime_forecast(actual_data = filter(trafficMAD_14_hr,day(day)>15)) %>% ## to see clearly the forecast, previous 15 days has filtered.
plot_modeltime_forecast(.interactive = FALSE) + labs(title = "Forecasts",y = "")
errors %>%
group_by(.model_desc) %>%
summarise(avg_mae=mean(mae, na.rm = FALSE),avg_rmse=mean(rmse,na.rm = FALSE)) %>%
arrange(avg_mae,avg_rmse)
## # A tibble: 4 x 3
## .model_desc avg_mae avg_rmse
## <chr> <dbl> <dbl>
## 1 NNAR(1,1,10)[24] 52.1 69.4
## 2 XGBOOST 56.9 83.4
## 3 RANDOMFOREST 68.2 83.3
## 4 GLMNET 198. 236.
After considering the mean evaluation accuracy of each model, it is clear that NNAR(1,1,10)[24] model is the winner. After this model, XGBoost is the second.
In this final project, the traffic intensity open data of Madrid city has been used. This multivariate time series raw data had the traffic information of 368 zones of Madrid with every 15 minutes frequency. First, to make the time series smoother and easy to analyze, the data has been aggregated into hour frequency and the observations in the data has highly decreased from 1,0485,575. Apart from this dataset, another dataset has been merged to get the location information of each ID. After getting the district information, the most busy area of Madrid has been chosen to analyze. Among 21 areas, district-14 has been found with the highest intensity. Then the hourly timeseries has been filtered by this district.
For the forecasting part in statistical tools, first the Vector autoregression model has been implied to the data. In this part, with most interesting 3 variables (occupation, intensity and vmed-the speed of the cars), VAR model has been fitted and predicted. As a result, the BIC model found the best and occupation and intensity variables forecasting found almost the same as they were having the similar series. Therefore, in the other statistical models, the intensity has been tried to forecast. After VAR model, manuel seasonal ARIMA has been fitted to the data to forecast the intensity. Then autoforecasting models has been studied such as arima, ets, tslm and prophet models. Also, the manual seasonal arima model added to see whether it is the best choice or not. Except for 5th split, auto arima chose the same model as the manuel defined arima model. After doing time series cross validation, the auto arima performed with the least evaluation accuracy (MAE: 175.7470, RMSE: 221.0155), and the manuel arima has performed similar (MAE: 179.8309, RMSE: 224.7249)
For the machine learning part, glmnet,random forest, XGBoost and neural networks ML algorithms have been studied with time series cross validation. For the cross validation, the same split time series used as statistical tools.The obtained results were way better than statistical models. Only, glmnet could not perform really good. NNAR(1,1,10)[24] (MAE:49.48689, RMSE:65.41056) and XGBOOST (MAE:56.92412, RMSE:83.37226) models performed very well. As a result, neural network was the one with the best model performance, therefore, this algorithm could be used to do more accurate traffic intensity predictions among all tried methods.